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AN  UNBIASED  STOP  RULE  FOR  A 

Max  A.  Bershad 

INTRODUCTION 

When  a  predesignated  number  of  sample  units 
is  to  be  selected  from  a  given  population,  we  know 
that  standard  simple  random  sampling  procedures 
will  give  each  sample  unit  an  equal  probability 
of  being  chosen  for  the  sample  which  is  independ- 
ent of  the  order  of  selection.  A  simple  unbiased 
estimate  of  the  mean  of  a  population  character- 
istic can  be  made  by  dividing  the  sample  total 
by  the  number  of  sample  units.   If  the  number  of 
sample  units  to  be  selected  is  not  predesignated 
but  becomes  a  variable  under  the  sampling  rules, 
we  can  no  longer  assume  that  the  sample  mean  pro- 
vides an  unbiased  estimator  of  this  population 
mean.  Let  us  suppose,  for  example,  that  the 
desired  sample  "size"  is  measured,  not  by  the 
number  of  sample  units  but  instead  by  the  sum  of 
a  quantitative  characteristic  of  the  sample 
units;  and  that  the  measure  varies  from  unit  to 
unit,  yet  is  never  negative.  A  simple  stop  rule 
under  these  circumstances  might  specify:   "Add 
sample  units  at  random  until  the  accumulated 
measure  for  the  sample  first  equals  or  exceeds 
A.-- then  stop  sampling."  It  is  readily  apparent 
that  such  a  rule  gives  a  large  sample  unit  a 
somewhat  greater  probability  of  inclusion  than  a 
small  sample  unit  and  that  the  probability  of  the 
inclusion  of  a  unit  in  the  sample  depends  on  the 
order  of  selection.  Thus,  when  the  accumulated 
measure  for  the  sample  is  such  that  a  large  sam- 
ple unit  will  cause  the  accumulation  to  reach  the 
stopping  point  while  a  small  sample  unit  will 
not,  an  ensuing  order  of  "large  unit-small  unit" 
will  include  only  the  large  unit,  whereas  an 
ensuing  order  of  "small  unit- large  unit"  will 
include  both.  Because  the  above  simple  stop  rule 
gives  some  preference  in  sampling  to  large  units 
as  compared  with  smaller  units,  one  expects  the 
sample  mean  to  be  a  biased  estimator,  as,  in 
fact,  it  is. 


RANDOM  SAMPLE  OF  VARIABLE  SIZE 

and  Walter  M.  Perkins 

It  is  possible  to  modify  the  simple  stop 
rule  in  such  a  manner  as  to  eliminate  the 
sampling  preference  for  large  units. 

The  purpose  of  this  memorandum  is  to  provide 
an  unbiased  stop  rule  and  to  examine  some  proper- 
ties of  the  rule  including  an  estimator  of  the 
variance  of  the  sample  mean. 

NOTATION 

We  are  given  a  population  of  N  elements. 
Let  X.  represent  a  nonnegative  "measure"  of  the 
i-th  element  in  the  universe,  with 


x1  <  x2  <  x3  < 


•  i  V 


Let  Y.  represent  the  value  of  the 
characteristic  of  interest  for  the  i-th  element. 
Our  interest  lies  in  estimating  Y. 


Let  J  X .  ,  X.  ,  X.  ,  ••.,  X.  ,  • . • ,  X. 

V   V   V         xj  xn. 


represent  a  random  permutation  of  elements  of  the 
universe  with  i  denoting  the  label  of  the  ele- 
ment in  the  first  position  (or  drawn  first);  i~ 
the  label  of  the  element  in  the  second  position, 
etc. 

Let  n  (. X.  ,  X.  ,  ...,  X.  ]  ,  or  more  briefly 
\  Xl   X2        W 

n',  represent  a  variable  sample  size  depending  on 

the  permutation  of  such  nature  that  it  will  be 

true  that 

n' 

Z  X.  >  \ 
J-l   J 

where  X   is  a  preassigned  constant.  Y  values  will 
be  available  only  for  the  n'  members  of  the 
sample.  The  stop  rule,  described  below,  to 
determine  n'  is  to  be  of  such  nature  that  the 
expected  value, 

~1 


E 


Z     Y. 

n'  .  ,   1 


is  to  equal  to  Y  over  all  possible  permutations. 


o 
a, 

ai 
3 


r 


PERKINS'  STOP  RULE  FOR  SAMPLE  SIZE 

First  select  a  random  permutation  of  the 
elements  (that  is,  of  course,  normally  done  by 
selecting  elements  one  at  a  time). 

For  the  selected  permutation,  find  the 
smallest  a  for  which 

a 

EX.   >  X. 

.    ,   1 .  = 
J=l   J 

Denoting  the  largest  of  the  a   cases  as 
L  (X. 

J=i  V  Jj 

find  the  largest  number  of  additional  elements, 
P,  such  that 


a  a     /      v    oH-p 

E  X.   -  L  r X.  J  +   E   X.   <  X. 

i=l  1j   j=l  \  \i/   j=o»-l  \i 


Then  the  sample  size  is 

a  +  P  =  n'; 

these  first  n'  elements  of  the  permutation  are 
in  the  sample;  and 

1  n' 

n'  E  Yi. 

J 

will  be  an  unbiased  estimator  of  Y. 

PROOF  OF  UNBIASEDNESS 

Regard  the  actual  realization 


A.  .   J  A  .    iA  .    *  .  •  ♦  ,  A  .   ,A  .       ,  ♦  •  •  ,  A  .      !  A  .         • 

1  '  1  '  1  '     '  1    1    '     '  1  ,'  X  ,' 

12   3      a   o+l      n'   n'+l 


We  will  prove  that  there  are  |n'  equivalent 
realizations,  derived  from  permuting  the  first 
n'  elements. 


-A,   >-"--,  y     •••,A 

kl  k2      ^n'-l 


where  X,   is  the  smallest  of  the  balance:  X,   is 

kl  k2 

the  next  smallest;  etc.  to  X      the  largest  of 


n'-l 


the  balance. 


The  permutation  which  would  have  the 
largest  sample  size  by  the  rule  is 

\     k2  k3      n'-l  \     ""n'+l 
and  the  permutation  which  would  have  the 
smallest  sample  size  by  the  rule  is 

J  X  ,X     , . . . ,X   ,X   ,X .     . . .  L  . 
I  V  V-l      k2  kl  Si'+l   J 


The  application  of  the  stop  rule  to  the 
first  of  the  permutations  results  in  a  =  n', 
P  =  0  so  that  n'  is  unchanged.  The  application 
of  the  stop  rule  to  the  second  also  leads,  as  is 
clear,  to  a  sample  size  of  n'  (even  though  a 
here  may  not  be  the  same  as  for  the  original 
realization) .   Since  the  shortest  and  longest 
possible  sample  size  is  n',  every  permutation  of 
the  n'  elements  followed  by  X.    ,  by  the  rule, 

V+i 

have  a  stop  at  n'.  (Note:  If  n'  =  N  in  one 
realization,  it  is  obvious  that  n'  is  always 
equal  to  N.) 

The  value  of  the  first  element  of  the  actual 

realization,  Y.  ,  is  obviously  an  unbiased  esti- 

Xl 
mator  of  Y.  The  element  in  the  firs.t  position 

of  each  of  the  permutations  is  also  an  unbiased 

estimator  of  Y.  So  that  we  have  n'  equally 

likely  estimators,  the  expected  value  of  each 

being  Y;  thus,  making 


First  relabel  the  largest  of  the  first  a 

as  X  .  This  is  equal  to  or  larger  than  any 

\ 
other 


Omitting 


X.  ,  j  -  1,  2,  . . . ,  n' . 

J 


1   n 
=  -,   E  Y. 
n'  .  ,   i  . 
J=l   J 


an  unbiased  estimator  of  Y. 

In  brief  the  proof  can  be  summarized  as 
follows : 


X.  and  Xn    , 

V+l  V 


Y     =     EY.        =     EnE0Y. 


relabel  the  balance  as 


where  E  is  the  expected  value  over  all  trials; 


E  is  the  expected  value  over  all  permutations 
for  a  fixed  combination  of  elements  in  the 
sample;  and  E  is  the  expected  value  over  com- 
binations; Y.   is  the  first  element  of  the  sample 

selected. 

Over  all  permutations  for  a  fixed  sample 
combination 


Vi, 


Y.  +Y.  +...+Y. 
h     X2 


This  leads  to  the  fact  that 


Y  =  E„ 


Y.  +Y.  +Y.  +...+Y. 
Xl  X2     X3 


"  E1E2 


Y.  +Y.  +...+Y. 
\     X2 


=  E 


Y.  +Y.  +...+Y. 
\      X2 


ESTIMATOR  OF  THE  VARIANCE  OF  y 

An  unbiased  estimator  is  possible  if  n'  >  2 
for  every  realization  (i.e.,  not  more  than  one  X. 
in  the  universe  exceeds  or  is  equal  to  x) . 

By  definition 

a2  =  E(y2)  -  Y2. 


y 


Therefore 


=  E(y2) 


-  [n  (  n  ; 


N(N-l) 
Z   Y.Y. 

A  .  N-i  ±A)      x  J" 

N    N(N-l) 


Since  y2  is  an  estimator  of  E(y2)  and 


n' 

Z  Y2  /n' 


=  Ey. 

Smaller  sample  sizes  than  n'  are  also 
available  but  as  will  be  seen  (and  it  is  obvious) 
the  variance  of  such  an  estimator  is  worse  than 
using  all  n' .  Thus 

1  m 

-  Z  Y. 

m  .  n   1 . 

J=l   J 

is  an  unbiased  estimator  where  the  first  m 
elements  in  the  realization  are  used  when  m  <  n', 
otherwise  n'  is  used.  And  of  casual  interest  is 
the  unbiased  estimator 


Y.    -Y. 

Z   Y.  ♦  ^^ 


which  can  be  used  if  one  obtains  the  observation 

Y.    .   (The  proof  follows  lines  similar  to 

Xn'+1 
those  already  given  and  involves  the  interchange 

of  Xn   and  X.    .) 
\  V+l 


The  unbiasedness  of  the  estimator  under  the 
given  stop  rule  has  also  been  proved  by  in- 
duction. This  proof  does  not  however  get  at  the 
essence  of  the  matter  which  is  the  invariance  of 
the  estimator  by  the  rule  under  the  different 
permutations . 


is  an  estimator  of 

N 

Z  Y2/N, 

i 

we  now  only  need  an  estimator  for  the  last  term. 

By  merely  regarding  the  first  two  positions 
in  every  one  of  the  permutations  of  our  reali- 
zation and  averaging  them,  or  alternatively  and 
equivalently  taking  all  combinations  of  two  from 
the  realization  in  hand  we  have  as  the  desired 
estimator  of  the  last  term 


1 


n'(n'-l) 


Z    Y.  Y.   . 


PT^IT      .^    V1 


Substituting  the  indicated  estimators,  and 
simplifying  the  result,  leads  to  the  estimator 


s2  =  d-f'H 

n' 

y 


where 


and 


f   =  n'/N 


--A(V> 


The  variance 


THE  COMPLEMENT  OF  THE  SAMPLE 


=     E(l-f')=?- 


In  the  special  case  where  every  X.  =1,  the 
development  here  is  the  same  as  simple  random 
sampling  without  replacement;  n'  is  a  constant 
for  all  realizations;  the  estimator  of  the  popu- 
lation variance  is  the  same;  but  the  population 
variance  becomes  easy  to  write  without  the  ex- 
pected value  operator  since 
random  variable  in  a2. 

y 


s  is  now  the  only 

y 


A  SPECIAL  CASE 

Another  special  case  is  that  in  which  each 
each  X.  is  either  1  or  0  (depending  on  whether 
or  not  the  element  is  defective  or  not—although 
in  this  case  one  would  normally  have  to  pay  for 
determining  the  value  of  X.).   In  this  special 
case  the  sample  would,  by  the  stop  rule,  consist 
of  all  elements  preceding  the  (x.+l)-st  defective 
and  would  lead  to  the  same  unbiased  estimate  of 
the  percent  defectives  as  appears  in  the  litera- 
ture. Because  of  the  generality  of  the  stop 
rule  of  the  memorandum,  the  inspection  scheme 
could  be  stopped  when,  say,  a  certain  dollar 
value  defective  number  was  reached;  and  an  esti- 
mate made  of  the  total  dollar  value  which  is 
defective  (when  the  total  number  of  items  or 
elements  is  known;  these  items,  in  the  general 
case,  having  unequal  values). 


In  simple  random  sampling  of  n  elements 
without  replacement,  the  nonselected  elements 
also  form  a  simple  random  sample.  So  that  if 
y/n  is  the  estimate  of  the  mean,  (Y-y)/(N-n) 
derived  from  the  complement  of  the  sample  is 
also  an  unbiased  estimate  of  the  mean. 

For  sampling  under  the  stop  rule  given  here, 
it  can  be  proved  that  from  the  complementary 
sample 


N     J^ 


(Y-y  -  O 

(N-n'-l) 


is  an  unbiased  estimator  of  the  mean  when  the 

largest  possible  n*  is  N-2.  If  the  largest 

possible  n'  is  N-l,  the  estimator  is  y. 

V+l 
when  n'  =  N-l  and  by  y'  as  given  above,  when 

n'  <  N-l.   If  n'  =  N,  we  always  have  a  100 
percent  sample  and  there  is  no  complementary 
group . 

The  characteristic  of  the  complementary 
sample  is  that  the  sum  of  the  measures  is  always 
less  than  some  predesignated  constant  and  sug- 
gests that  there  is  a  complementary  stop  rule 
that  can  be  applied  sequentially.   In  point  of 
fact  there  is — namely,  stop  after  the  (a-l)-st 
draw  and  use  the  estimator 

(a-1) 
£  Y. 

if-  ,\i 

(a-l) 

when  (a-l)  >  1,  and  Y.  when  (a-l)  =  0. 

xl 


USE  OF  PERKINS'  STOP  RULE  WHEN  SAMPLING  WITH  REPLACEMENT 

Max  A.  Bershad 

That  the  estimates  given  above  are  unbiased 

by  the  stop  rule  is  an  immediate  consequence  of 
the  fact  that  all  the  permutations  of  the  |n' 


INTRODUCTION 

The  unbiased  nature  of  the  estimator  of  Y 


under  the  stop  rule  when  X  is  a  population 
characteristic  of  every  element,  e  ,  of  the 
universe  and  when  sampling  was  done  without 
replacement  was  considered  in  the  preceding 
memorandum. 

The  purpose  of  this  memorandum  is  to  point 
out  that  when  sampling  the  elements  with  replace- 
ment probability  proportionate  to  size  (p.p.s.) 

that  by  proper  assignment  of  X.   values,  the 

J 
stop  rule  again  produces  unbiased  estimates. 


selections,  followed  by  X.    ,  have  the  same 

1n'+l 
stopping  point  and  are  equally  likely.   Conse- 
quently one  only  needs  to  write  an  unbiased 
estimator  utilizing  the  first  draw,  namely 


1 
N 

"1 


and  average  over  all  n'    equally  likely  first 
draws . 


FIRST  METHOD  OF  ASSIGNMENT 

If  one   sets 

X.        =     X. 

1  .  l . 

J  J 

the  first  time  the  element  i .  is  selected  and 


sets 


X. 

i  . 
J 


=  0 


when  the  element  has  been  previously  selected; 
and  if  one  now  applies  the  stop  rule  to  the 
cumulation  of  X.  ,  as  in  the  previous  memorandum, 

then 

1   n' 

n<  =  Yi. 
J=l   J 

is  an  unbiased  estimator  of  Y  if  equal  proba- 
bility sampling  has  been  used;  and 
1/1   _  -  \ 


N 


y  - 


is  an  unbiased  estimator  of  Y  if  unequal  proba- 
bilities have  been  used;  p  being  the  probability 

of  selection  of  the  element  en  whenever  a  draw 

k 

is  made. 


Samford1  in  Biometrika,  1962,  treats  the 
case  of  unequal  probability  sampling  with 
replacement  with 


X. 


=  1 


the  first  time  element  i .  is  selected  and 

J 


X. 


l . 
J 


=  0 


if  it  has  been  previously  selected. 

OTHER  METHODS  OF  ASSIGNMENT 

Other  alternatives  are  available  for  the 
measures  to  be  cumulated  without  counting  as 
zeros  the  reappearance  of  a  previously  selec- 


ted element.  Thus 

X. 

i  . 


X. 


a 


JL 


i . 

j 


j 


i.'  (A+BX.  5 


etc.,  can  be  cumulated  for  application  of  the 
stop  rule.  The  estimator  of  mean  would  be 
unchanged. 


"""Sampford,  M.R.   "Methods  of  Cluster  Sampling 
With  and  Without  Replacement  for  Clusters  of 
Unequal  Sizes."  Biometrika,  Vol.  h$f   Nos.  1-2, 
1962,  pp.  27- ko. 


Or  any  of  the  above  can  be  used  the  first 
time  the  sample  element  is  drawn,  with  0  used 
for  cumulation  on  subsequent  selection  of  the 
element,  as  in  the  first  method  of  assignment. 


THE  ESTIMATOR  OF  THE  VARIANCE 

For  all  methods  of  assignment  mentioned 
in  this  memorandum,  when  n'  >  2  for  every 
realization,  the  estimator  of  the  variance  is 

s2 

s2  =  -Z 

n' 


where 


and  where 


yi 


I.(K 


-y 


n'-l 
Y. 


namely  the  estimator  one  would  use  if  one  only 
utilized  the  element  appearing  in  the  first 
position  of  a  permutation  of  the  actual 
realization. 

This  estimator  is  analogous  to  the  estimator 
for  without-replacement-sampling  without  the 
variable  finite  multiplier,  (l-f). 

The  proof  follows  from  the  fact  that  in 

a2  =  Ey2  -  Y2, 

y 

the  first  term  is  estimated  by  y2  and  Y2  is 
estimated  by 


n'(n'-l)  Yi  Yi 

L  *■  *■  - 

1 " — ^ — * 

n'(n'-l) 


THE  VARIANCE  OF  THE  REPLICATION  METHOD  FOR  ESTIMATING  VARIANCES, 

FOR  THE  CPS  SAMPLE  DESIGN 

Margaret  Gurney 


A.  Summary 

With  a  number  of  assumptions,  a  formula  is 
developed  for  the  increase  in  the  variance  (over 
the  grouped-strata  variance)  of  the  CPS  sample 
due  to  the  use  of  replications  as  a  method  of 
estimating  variances.  Using  Equation  (.17).,  a 
table  of  the  relvariance  of  the  estimated  vari- 
ance for  different  numbers  of  replications  is 
computed. 

Further  discussion  leads  to  the  conclusion 
that  20  replications  will  produce  (for  most 
items)  variance  estimates  which  have  coef- 
ficients of  variation  of  .33  or  less.  With  h-0 
replications,  the  coefficient  of  variation  of  a 
variance  estimate  will  be  of  the  order  of  .25, 
or  less. 

B.  Background 

1.  Before  the  introduction  of  the  composite 
estimate,  the  component  of  variance  for  an 
estimate  from  the  Current  Population  Survey, 
coming  from  the  non-self -representing  (NSR) 
Primary  Sampling  Units  (PSUs),  was  estimated  by 
the  "Grouped  Stratum  Method."   In  this  method, 
the  NSR  PSUs  are  grouped  into  G  groups  of  L 
strata  each  (usually  L  =2),  and  an  estimate  of 
the  variance  of  each  group  is  obtained  by  treat- 
ing each  of  the  L  estimated-PSU-totals  as  if  it 

g 
were  drawn  with  replacement  from  the  combined 

group  of  strata. 

Within  the  self -representing  (SR)  PSUs  the 
sample  segments  were  assigned  to  random  groups, 
and  the  group  totals  were  used  to  estimate  this 
component  of  the  variance. 

2.  The  composite  estimate  crosses  over 
strata  lines,  and  involves  SR  as  well  as  NSR 
PSUs;  hence  the  grouped  strata  method  is  not 
literally  applicable  for  estimating  the  vari- 
ance.  Indeed,  this  trouble  was  present  even 


before  the  use  of  the  composite  estimate,  since 
the  second  stage  ratios  estimate  adjusts  to  an 
age-sex  distribution  that  is  independent  of  the 
CPS  stratification. 

3.  The  replication  method  for  estimating 
variances  gets  back  to  first  principles:   for 
each  replication  as  estimate  is  made  for  the 
characteristic  under  consideration,  using  all 
of  the  steps  of  the  usual  CPS  procedure,  includ- 
ing the  composite  estimate;  the  variance  between 
these  estimates  is  then  computed.  This  is 
exactly  what  we  mean  by  the  variance  of  an 
estimate  from  a  sample. 

h.  .   In  this  memorandum  we  consider  only  the 
part  of  the  variance  coming  from  the  NSR  PSUs  in 
the  sample.   For  the  SR  PSUs,  sample  segments 
are  grouped  into  pseudo-PSUs,  and  these  are 
treated  similarly. 

C.   Non-Self -Representing  PSUs 

The  contribution  to  the  variance  from  the 
NSR  strata  of  the  CPS  sample  can  be  estimated 
as  follows: 

1.  Group  the  NSR  strata  into  G  groups  of 

L  strata  each, 
g 

2.  Make  L  estimates  of  the  group  total 

for  the  characteristic  being  estimated 
for  each  group  of  strata:   use  the 
sample  segments  in  each  sample  PSU  to 
make  an  estimate  for  the  group  of 
strata  to  which  it  has  been  assigned, 

3.  Select  one  of  the  L  estimates  at  random 

g 
from  each  of  the  G  groups  of  PSUs, 

h.      Combine  the  G  estimates  so  obtained  to 
get  an  over-all  estimate  for  all  of  the 
NSR  strata;  this  estimate  may  involve 
the  age-sex-color  adjustment,  a  com- 
posite estimate,  or  any  other  refinement 
which  we  wish  to  employ. 
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The  process  up  to  this  point  is  one  replication; 
it  produces  one  sample  estimate  (from  a  random 
half -sample  if  L  =2)  of  the  characteristic  under 
consideration,  which  is  computed  in  the  same 
manner  as  the  CPS  estimate  is  computed  from  the 
whole  sample. 

5.  Repeat  steps  (3)  and  {k)   R  times,  for  R 
random  selections  from  each  group. 

We  now  have  R  estimates  of  the  characteristic, 
from  R  replications. 

6.  Use  the  R  estimates  so  obtained  to 
estimate  the  sampling  variance  of  the 
estimate  of  the  characteristic  under 
consideration. 

The  estimate  of  the  variance,  using  this 
procedure,  will  become  more  reliable  with  in- 
creasing values  of  R;  with  a  lower  bound,  as 
R  ■>■  00,  which  is  the  usual  grouped-strata  esti- 
mate of  the  variance. 

D.   Notation  and  Assumptions 

For  the  replication  method  we  notice  that 
each  replication  produces  an  estimate  which  is 
based  on  one-half  of  the  data.  The  R  replica- 
tions use  R  different  random  halves  of  the  whole 
body  of  data  available  for  the  CPS  estimate. 

We  also  have  the  CPS  estimate,  based  on  all 
data,  which  is  produced  in  processing  steps 
parallel  to  those  used  for  each  replication. 
Each  replication  estimate  is  assumed  to  be  an 
unbiased  estimate  of  the  CPS  estimate. 

Let  x'   be  the  CPS  estimate  of  the  total, 
from  the  whole  sample,  and  assume  that  it  is  of 
the  form 

G       G  Lg        G  Lg 
x'    =  Z  x'   =  Z  Z  x'    =  ZZx»  ,.v      (l) 
ge     ghB      ghBW 

where  x'  ,.\    is  the  estimated  stratum  total 
ghU) 

for  the  h-th  stratum  in  the  g-th  group,  based 

on  the  i-th  PSU,  which  is  the  sample  PSU  for 

this  stratum.  We  assume  (for  purposes  of  this 

memorandum)  that  the  effects  of  the  second  stage 

ratio  adjustment,  and  of  the  composite  estimate, 

will  not  seriously  affect  the  stratification  of 

the  CPS  design. 


Let  x'  be  the  estimate  for  the  a  replication: 

a  * 


x'   =  Z  L  x' 

a    „  g  ga 


(2) 


where  x'   is  the  estimated  stratum  total  for  the 

ga 
stratum  in  the  g-th  group  which  is  selected  at 

the  a-th  replication. 

The  conditional  expected  value,  over 
replications  is 

G  G  Lg 

Ex'   =ZLEx'    =  Z  Z  x\   =  x' 

a       g    ga      .   gh     CPS 
a      g  e  a       gh 

and  over  all  possible  samples: 


Ex' 


a 


where  X  is  the  population  total  to  be  estimated. 

E.  The  Variance  Estimate 

Suppose  that  each  group  contains  the  -same 
number  of  strata  (L),  and  that  the  measures  of 
size  (A  ,  h  =  1,  2,  ...,  L)  of  the  L  strata 

which  form  the  g-th  group  are  approximately 

equal,  so  that  A  ,/A  =  l/L.  Then  the  usual 
'         gh  g 

grouped-strata  estimate1  of  the  variance  of  the 

estimated  total  (x')  becomes 


SGR(XCPS)   =  ZLTT^Xgh-g)2 
g     h 


(3) 


where 


x'   =  f  Z  x',  . 

g     Lh  gh 


We  have 


U) 


EsgR  ^  .V)mLU gh-x)2 

g  h 

Where         G  L  G  L 

a2(x')   =  Z  Z  a2(x')   =  Z  Z  a2 

g  h     B      gh 


and  a2,  is  the  variance  within  the  g-th  stratum, 
gh 

If  the  strata  in  each  group  are  about  equal 

in  size  for  the  characteristic  being  measured, 

so  that  X  ,_-X  =0,  the  second  term  in  Eq.  (k) 

gh  g    ' 
will  disappear,  and  the  estimate  will  be 

approximately  unbiased. 


Hansen,  Hurwitz  and  Madow,  Sample  Survey 
Methods  and  Theory,  Vol.  II,  p.  218,  Eq.  (5.2). 


For  the  replication  estimate  we  define 
R 


R 


1   1 
L-l  R 


E  (: 


a  cps 


)2, 


(5) 


where  x'  is  given  in  Eq»  (2). 

The  conditional  expected  value  of  sf:,  over 

K 

all  possible  replications,  is  s2  ,  as  can  be 
readily  seen;  the  expected  value  over  all 
samples  is  a2(x'),  assuming  the  bias  from  group- 
ing of  strata  to  be  negligible. 

F.  The  Variance  of  the  Estimate  of  the  Variance 

We  wish  now  to  evaluate 


a2(s2)   =  E(sJ)  -  a4(x') 


=  E(s' 


,)  +  ^O 


R  GR'     N  GR 
1.  The  conditional  expected  value  of  s 


(6) 


R 


over  all  replications  can  be  shown  to  be  (the 
algebra  is  straightforward,  but  tedious) 


a 


t2       i       G  L 
-±—     hi  E(x'    -x   )' 

(L-l)2*       gh     gh     « 


G    rL 
-  E     Z(x»    -x   ) 

gLh     gh     g' 


G     L  L 

+   2  E     E(x*    -x   )2  Z(x' .-x   )2 


^ch     gh     * 


cj 


(7) 


2.  Further  algebra  leads  to  the  expected 
value  over  all  samples  of  PSUs  which  could  have 
been  selected  for  the  CPS: 

G  L 


E(s 


4x1    f(L-2)(L-3) 

n_D  /  u    "1         JIT     11 


(L-2)(L-6) 


R     GR' 

G     L 


E  E  U-i 


LlL^-—  %:gh 


G     L 


E     E 


o2a2 .  +  2     E     E   a2,    E   a2 . 
(L-l)2       g  h/j     gh  gJ  g^c  h     gh   j      °J 


(8) 


Here  u,   ,  is  the  expected  value  of  the 
U:gh         F 

fourth  moment  of  the  sample  estimate  in  the 
gh-th  stratum  (based  on  a  sample  of  segments  in 
the  i-th  PSU)  around  the  stratum  total: 

u,   ,   =  E(x'  -X  J4. 

4:gh       gh  gh' 

If  P  .  .  is  the  probability  of  selecting  the 

i-th  PSU  in  the  gh-th  stratum,  and  x\  .  is  the 

ghi 

estimated  total  for  the  i-th  PSU,  estimated  from 
the  sample  of  segments  in  that  PSU,  then 


gh 


ghi  ghi 


(9) 


and 

u,   ,   =  E(x»  ./P  ,  .-X  J4 
^h:gh  ghi  ghi  gh' 

=  E(x-  .-X  ^.)4/P\.+E(X  ,./P   -X  J4 
ghi  ghi    ghi    ghi  ghi  gh 

+  —2-  E(x'  .-X  .  .)2.  E(X  ,./?   .  .-X  .  )2 
p2      ghi  ghi'      ghi  ghi  gh' 

ghi 

That  is.  U)   .  involves 
'   ^:gh 

a)  The  fourth  moment  of  the  sample  within 
the  i-th  PSU  about  the  PSU  total, 

b)  The  fourth  moment  of  PSU  totals  as 
estimates  of  the  stratum  total, 

c)  The  expected  value  of  the  product  of  the 
variance  within  the  i-th  PSU  by  the 
variance  between  PSUs. 

Similarly,  within  the  gh-th  stratum, 


gh 


=  E(x'  ./P   -X  J2 
ghi  ghi  gh' 

=  E(x»   -X  ,  .)2/P2.  .  +  E(X  ,  ./P  .  .-X  J2 
ghi  ghi'   ghi      ghi  ghi  gh' 


or 


a2,   =  a2  .  +  a2  . 
gh      W:gh    B:gh 


(10) 
(11) 


where 


a2  .   =  E  a2(x'   )/P2    =  E(x'   -X  V.)2/P2 
W:gh     i  iv  ghi'  ghi     ^  ghi  ghi'   ghi 

is  the  average  within-PSU-variance  of  the  esti- 
mated PSU  total, 
and 


B:gh 


(X,  ./P.  .)   =  E(x«  ,/P  ,,-X,  ) 


ghi  ghi 


ghi  ghi  gh' 


is  the  between-PSU- variance  of  the  estimated 
stratum  total. 

Eq.   (9)   may  be  written  as 

u,      .      =     E  u,(x'    .)/P4    +6fff     .    E  a2(x'      )/P2 
4:gh  .   ^hK   ghi'     ghi       B:gh   .     xv   ghi'     ghi 


+     E  u,,(X  ,  ./P  .  ,) 
±^kK   ghi     ghiy 

+  Gat 


^:Bgh  +  DaB:ghVgh  +  ^:gh 

3.     In  order  to  evaluate  the  terms   in 

we  assume  that  the   fu-i      ,_}   and   {a2}    in  Eq.    (8) 

4:?n       •'■• 


(12) 


gh' 

are  approximately  constant  from  Stratum  to 

stratum;  i.e. : 


H 


^:eh 


gh 


We  define 


=  ^) 


U3) 

(1*0 
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and  substitute  Eq.   (13)    and  (ik)    into  Eq.   (8): 

Ws4  «,4  \  GL  J(L-2)(L-3)   * 

^VW      "     T   I     L(L-l)        % 

.  g4[(L-2)U-6)   +  2(G,l)L]j        ■ 


GL  c4f(L-2)(L-3)   £       (L-2)(L-6) 


UL-1) 


L-l 


+  2(G-l)L 


(15) 


h.     Making  the  same  assumptions,  the  second 
term  of  Eq.  (6)  becomes 


,2/  „2 


(b*  )   =  GL^ 


GR 


P  - 


L-3 
L-l 


(16) 


Substitution  of  Eqs.  (15)  and  (l6)  into  Eq.  (6) 

expresses  a2(s2)  in  terms  of  P  and  a4, 
n 

5.  To  find  the  relvariance  of  s2,  use 


R' 


a2(x^,ps)  =  GLS2 


to  find 


P  - 


L-3 


-%|ikfl+2M}.      (17) 

The  term  in  braces,  in  Eq.  (17),  is  the 
increase  in  variance  due  to  replications.  As 
the  number  of  replications  (R)  increases,  the 
contribution  from  this  term  decreases,  and  for 
R  very  large,  V2(s2)  approaches  the  relvariance 
of  the  grouped  strata  estimate,  which  is  given 
by  the  first  term  of  Eq.  (17). 

G.  Magnitude  of  the  Increase  due  to  Replications 

Table  1  shows  increase  in  variance  as  the 
number  of  replications  increases,  for  several 
values  of  P  and  for  different  numbers  of  repli- 
cations. 

The  computations  use  LG  =  170,  the  number  of 
NSR  strata  in  the  230  area  CPS  sample,  which  was 
used  in  the  middle  1950's;  the  number  of  strata 
grouped  together  is  L  =  2,  so  that  formula  (17) 
becomes 


V2(s2)      =     £Li+2G-l 
V   KSR}  GL        R     G      * 


The  last  line  of  Table  1  (for  R  >  °°)  gives 
the  relvariance  for  the  grouped  strata  estimate 
of  the  variance:  subject,  of  course,  to  all  of 
the  assumptions  made  in  the  derivation  here. 

Table  1 . —RELVARIANCE  OF  THE  ESTIMATED  VARIANCE 
FOR  SPECIFIED  NUMBERS  OF  REPLICATIONS, 

AS  A  FUNCTION  OF  P:   L  =  2,  G  =  85. 


Relvariance  of  the  variance  when 

Number  of 

p  is  assumed  to  be 

replications 

J> 

6 

10 

20 

1 

2.000 

2.018 

2.0U1 

2.100 

2 

1.012 

1.030 

1.053 

1.112 

h 

0.518 

.535 

0.559 

0.6l8 

10 

0.221 

.239 

0.262 

0.321 

20 

0.122 

,lkG 

O.163 

0.222 

Uo 

0.073 

.091 

0.114 

0.173 

00 

0.02U 

.0U1 

0.065 

0.124 

H.   Interpretation  of  the  Table. 

1.  To  extend  the  use  of  the  table  to  the 
whole  230  area  sample,  we  use  the  fact  that,  for 
the  replication  estimates,  the  segments  in  the 
60  self -representing  areas  are  assigned  at  random 
to  l6  pseudo-PSUs,  which  are  grouped  into  eight 
pairs  (L=2,  G=8).  About  one-half  of  the  popula- 
tion of  the  country  is  in  the  SR  PSUs;  conse- 
quently these  16  "strata"  are  approximately  ten 
(=  170/l6)  times  as  large  as  the  average  NSR 
stratum.  The  sample  in  each  of  the  l6  pseudo- 
PSUs  is  quite  large,  so  that  the  distribution  of 
sample  estimates  is  probably  nearly  normal,  and 
the  average  p  may  be  assumed  to  be  close  to  3. 

The  rel-variance  for  the  entire  sample  will 
be  a  weighted  average  of  the  numbers  in  Table  1 
and  a  set  of  numbers  for  the  SR  strata,  where 
P  =  3  and  the  variance  consists  entirely  of 
variance  within  SR  PSUs.  We  find 


(18) 


11 


%o^ 


m  170(^(i7o)  +  °i7o)+  l6(3ai6 +  °ie)t 


(I70a^70  +  I6ay2 


(19) 


The  resulting  rel-variance  will  be  smaller  than 
shown  in  Table  1,  but  may  not  be  much  smaller 
since  the  NSR  part  is  the  largest  part  of  the 
variance. 

2.   For  most  characteristics,  the  value  of 
P  in  Table  1  is  probably  between  three  and  ten. 
This  statement  is  based  on  the  following 
considerations : 

a)  The  P  for  a  single  stratum  involves 
fourth  moments  and  variances  both 
between  and  within  PSUs . 


H 


kB 


+   6°BCT5  +  » 


L4w 


gh 


gBPB  +  6gBCTS  +  ♦ 


(20) 


b)   Studies  of  variances  for  the  230  area 
sample  indicate  that  for  a  sample  size 
of  25,000  households  the  within  variance 
is  over  two-thirds  of  the  total  variance 
for  most  items,  and  sometimes  is  as  much 
as  85  percent  of  the  total  variance. 

That  is,  of   is  of  the  order  of  two  to 

w 

four  times  erf:. 
B 

For  this  discussion  let  us  assume 


°S  =  H 


c)  In  a  stratum  where  the  characteristic 

being  estimated  is  not  highly  correlated 

with  the  measure  of  size,  the  p  between 

PSUs  may  be  quite  large;  in  a  particular 

case,  a  value  of  p  =  11.7  was  found; 
B 

a  p  =  35  was  found  for  a  statistic  for 
B 

which  nearly  all  the  variance  came  from 

one  stratum. 

We  use  P^  =  27. 
B 

d)  The  within  PSU  variance  depends  on  the 
characteristic,  and  the  number  of  seg- 
ments in  the  sample.  For  an  item  which 
is  10  percent  of  the  total  population, 


the  P  for  a  sample  of  one  element,  with 
simple  random  sampling,  is  8.111.   We 
triple  this  number  for  clustering 
(clusters  of  six  households,  or  20 
persons),  and  take 


Jl 


=     2k. 


e)  The  P  for  a  sample  of  n  segments  is 
given  by 


1 


n-1 
n 


P1  +  3(n-l) 
n 


For  the  230  area  sample  there  are  10  to 
12  segments  per  sample  PSU,  in  the  NSR 
part  of  the  sample. 


Conservatively  take  n  =  10,  and  find 
p  +  27 
pW  = 


2k   +  27 


=  5.1. 


10         10 

f)  Using  the  preceding  assumptions,  and 
carrying  them  to  averages  over  strata, 
we  find  by  substitution  into  Eq.  (20) 

J  .  27  +  6(1X3)  +  5.1(3)2  _  91  _  5  7 


(l+3): 


This  P  may  be  of  the  right  order  for  a 
characteristic  which  is  about  10  percent 
of  the  population. 

g)   For  a  characteristic  which  is  2  or  3 
percent  of  the  population,  such  as  the 
number  of  unemployed  persons,  the  Py  will 
be  larger.  For  p  =  .03,  the  p.  for  a 
simple  random  sample  of  one  person  is 
about  30.   If  this  is  tripled  for 
clustering,  we  find 

90  +  27  _  u  ? 

Moreover,  for  the  number  of  the 
unemployed,  ag  is  close  to  ko^.     Assuming 


P^  to  be  15,  we  find 
B 

g  k   15  +  6(lxU)  +  ll.7(^); 


226 
25 


=  9.0. 


h)   In  both  of  these  computations  a 

conservative  position  has  been  taken; 
the  actual  P's  may  be  somewhat  smaller, 
since  the  effect  of  averaging  fourth 
moments  and  variances  over  strata  is  to 


12 


remove  the  extreme  observations.  It  may- 
be difficult  to  find  a  bonafide  situ- 
ation in  which  the  p  is  greater  than 
ten,  and  a  value  as  high  as  20  is  un- 
likely, unless  almost  all  of  the 
variance  comes  from  a  small  number  (one 
to  five)  of  the  strata. 

I.   Conclusion 

Table  1  indicates  that  there  is  quite  an 
increase  in  the  relvariance  of  the  estimated 
variance,  due  to  the  use  of  the  replication 

A      - 

method.   For  example,  for  P  =  6,  the  coefficient 
of  variation  of  21  percent  (for  the  grouped 
stratum  estimate)  is  increased  to  30  percent 
when  J+O  replications  are  used,  and  to  38  percent 
when  20  replications  are  used  to  estimate  the 
variance.  These  increases  are  for  the  NSR  part 
of  the  230  area  sample  only. 


For  xne  whole  230  area  sample,  the  numbers 
will  be  somewhat  smaller.   If  the  sample  is 
enlarged  to  357  areas,  with  an  increased  sample 
size  (to  35,000  households)  the  coefficient  of 
variation  of  the  estimate  of  the  variance,  for 
(3  =  6,  and  20  replications,  may  be  about  .33. 
For  ho   replications  the  coefficient  of  variation 
for  (3  =  6  will  be  about  .25. 

We  conclude  that  20  replications  will  give 
serviceable  estimates  of  the  variance  for  most 
items. 

This  discussion  has  assumed,  essentially,  a 
straightforward  unbiased  estimation  procedure 
for  estimating  a  total.  For  the  estimation  of  a 
ratio,  or  the  use  of  a  ratio  or  composite  esti- 
mate, the  results  obtained  here  may  serve  as  a 
guide  in  evaluating  the  reliability  of  the 
estimate  of  the  variance. 


MCCARTHY'S  ORTHOGONAL  REPLICATIONS  FOR  ESTIMATING  VARIANCES,  WITH 

GROUPED  STRATA 

Margaret  Gurney 


During  a  visit  to  the  Census  Bureau, 
Professor  Phillip  J.  McCarthy  of  Cornell  Uni- 
versity mentioned  a  device  for  selecting  repli- 
cations which  gives  the  full  number  of  degrees 
of  freedome  of  the  grouped  stratum  method  for 
computing  variances.1  That  is,  instead  of 
requiring  2  replications  for  n  groups  of  strata 
(each  consisting  of  a  pair  of  original  strata) 
we  need  only 


k 


n+3 


replications,  where 

is  divisible  by  K. 

The  device  used  is  to  select  orthogonal 
replications  so  that  the  between  stratum 
variance  introduced  by  each  replication  will 
cancel  across  the  set  of  ^K  replications. 

ILLUSTRATION 

Consider  6  strata,  which  have  been  grouped 
into  3  groups 

I  consists  of  strata  1,2 
II  consists  of  strata  3,^ 
III  consists  of  strata  5,6. 

An  estimate  of  total  from  the  whole  sample  is 


:'   =  Xl+x2+x3+xu+x5+x6. 


One  of  the  replications  involves  estimates 
(x  ,  x  ,  x  );  the  estimate  of  total  from  this 
replication  is 

x        =      2(x   +x   +XJ. 
a  13     5 


The  estimate  of  variance  from  this  one 
replication   is 

2(X]+x3+x5)    -   (x^Xg+x  +x]++x5+x6) 

=  (xn-x2)2  +  U3-x^)2  +   U5-x6)2 

;     5  (x1-x2)(x3-x1+)   +   (x1-x2)(x  -Xg 


+  (x3-x^)(x5-x6) 


(1) 


The  last  two  lines  of  equation  (l)  show 
the  additional  terms  in  the  estimate  of  variance 
coming  from  using  replication  (a).   Some  other 
replication,  e.g.  (f3): 

xe  =  ^wV 

will  lead  to  other  terms  which  are  added  to  the 
estimate  of  variances;  e.g., 

2 


U-j-XgHx  -x^)  -  (x1-x2)(x_-x6) 


+  (x3-xu)(x5-x6) 


(2) 


For  3  groups  of  strata  the  following  choice 
of  k   replications  will  give  added  cross  products 
terms  which  all  cancel;  the  set  of  replications 
is  said  to  be  "orthogonal." 


Replication 

Stratum 

I 

II 

III 

1 

Xl 
Xl 
X2 
X2 

X3 
xh 

X3 
xk 

x^ 

2 

5 
x^ 

3 

k 

6 

x6 
x^ 

5 

4he  replication  method  (sometimes  referred 
to  as  the  balanced  half -sample  technique)  was 
subsequently  described  by  Professor  McCarthy  in 
publications  of  the  National  Center  for  Health 
Statistics  (Series  2,  Nos.  lk   and  31). 
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Symbolically,  let  +  indicate  the  odd 
subscript  in  a  pair  of  strata,  and  -  indicate 
the  even  subscript.  Then  the  orthogonal 
pattern  for  2,  3>  or  k   grouped  strata  is 


Stratum 

Replication 

2  groups  of 
strata 

3  or 
of 

k   groups 
strata 

I 

II 

III 

IV 

1 

+ 
+ 

+ 
+ 

+ 

+ 

+ 

2 

3 

k 

+ 
+ 

+ 

In  this  pattern  each  pair  of  columns  has 
2  replications  for  which  the  signs  are  the  same 
(both  +  or  both  -);  and  2  replications  for  which 
the  signs  are  different.   In  this  illustration 
the  pattern  for  k   strata  is  symmetric  about  the 
minor  diagonal;  this  however  is  not  necessary. 
The  following  pattern  would  also  be  all  right; 

+  +  +  - 

+  -  -  - 

GENERALIZATION  TO  MORE  STRATA 

To  generalize  to  5  strata,  we  need  8 

replications  (always  a  multiple  of  k) ;    these 

are  selected  from  2  =  32  possible  replica- 
tions.  For  6,  7  or  8  strata,  we  still  need 

only  8  replications,  selected  from  6k,    128  or 
256  possible  replications. 

The  matrix  for  the  extension  to  5,   6,    7, 
8  strata  is  shown  below.  The  method,  in  going 
from  k   to  8  strata  is 

(1)  Copy  the  k   columns  of  the  original 
matrix  to  columns  5>  6,    rJ,    and  8, 
on  lines  1  to  k. 

(2)  Copy  the  original  matrix  to  lines 
5  to  8,  for  columns  1  to  k. 

( 3)  Fill  the  remaining  k   x  k   matrix 
(columns  5  to  8,  lines  5  to  8) 
by  the  complementary  matrix, 
obtained  by  changing  (+)  to  (-) 
in  the  k   x  k   matrix. 


Stratum 

Replication 

r 

1  to 
of 

k   groups 
strata 

5  to  8  groups 
of  strata 

I 

II 

III 

IV 

V 

VI 

VII 

VIII 

1 

+ 
+ 

+ 
+ 

+ 
+ 
+ 
+ 

+ 

+ 
+ 

+ 

+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 

+ 
+ 

+ 
+ 

+ 

+ 

+ 
+ 

+ 

+ 

+ 

+ 

+ 

2 

+ 

3 

k 

+ 

+ 

5 

6 

8 

Here,  again,  any  two  columns  have  a  pattern 
of  signs  (+  and  -)  such  that  the  product  of  the 
signs  row  by  row  leads  to  the  same  number  of 
plus  signs  {k)   as  minus  signs  ( k) . 

The  generalization  to  9  or  more  groups  of 
strata  can  be  made  easily. 

VARIANCE  OF  THE  ESTIMATE 

Notation: 

Let  us  represent  the  difference  of  the  two 
stratum  estimates  in  the  i-th  group  by  d.: 

a  =  x1-x2 

d2  =  X3'^ 


Also  let 


D. 


ij 


X2n-l"X2n' 


=  d..d.. 
1  J 


Then  the  additional  terms  in  equation  (l)  may 
be  written  as 

2(D12+D13+D23)- 

Generalizations : 

For  n  groups  of  strata,  the  additional 
terms  from  a  single  replication  are  of  the  form 

Da  =  2^12+\3t"-l\.1^  (3) 

where  the  signs  are  determined  by  the  par- 
ticular replication. 

For  a  set  of  orthogonal  replications 
(and  also  over  all  replications)  we  have 


E(D  )   =  0. 


(*0 
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Variance : 

The  variance  of  the  additional  terms,  which 
is  the  increase  in  the  variance  of  the  estimate 
of  the  variance  due  to  the  use  of  replications, 
is  given  below. 

(a)  For  one  replication 

a2   .  e(D  -ED  )2  =  ED2.       (5) 
D       a   cr       a        ' 

a 

It  can  be  shown  that  for  an  orthogonal  set  the 
D.  .  I  are  also  orthogonal,  so  that 


and 


VD^      ° 


n-1 


a2   =  ED2  =  h       Z  D2.. 


D 


a 


■*  1J 


(6) 


(7) 


~ct  l<j 

(b)  For  k  replications  the  additional 


terms  are 


lk 
A,   =  r  Z  D 
k    k    a 

a 


and  the  variance  of  An  is 

k 


UK-k  1  2 
W^l   k  CTD 


(8) 


a 


where  UK  is  the  smallest  number,  divisible  by  h, 
which  is  greater  than  or  equal  to  n. 

COMPARISON  WITH  RANDOM  SELECTION  OF 
REPLICATIONS 

If  the  replications  are  selected  at  random, 
the  added  terms  for  any  one  replication  are 
still  given  by  equation  (3),  and  the  variance 
for  one  replication  is  still  the  same  as  that 
in  equation  (7)«  But  the  number  of  distinct 
possible  samples  is  now  2  "  .   (Note  that  for 
3  strata  the  replication  x  ,  x„,  x  has  the  same 
increase  as  x  ,  x,  ,  x^- — except  for  sign — so  that 

3-1 
there  are  2    =  h   distinct  replications, 

instead  of  8,  for  variance  purposes.) 

The  variance  in  the  unrestricted  case  is 


-k  1   2 


0n-l  .   k   D  * 
2-1      a 


(9) 


Consequently,  the  variance  of  a  sample  of 
size  k  selected  from  an  orthogonal  set  of  repli- 
cations is  smaller  than  the  variance  of  a  sample 
of  the  same  size  from  the  whole  set  of  replica- 
tions, by  a  factor  of 


f  = 


UK-k 
W 


-k   /  2n-1-k\ 
-1  '  V  2-1- J 


(10) 


If  the  number  of  groups  n  is  large,  the  factor 
becomes 

f  ~  WZ  Ul) 

which  is  the  finite  multiplier  for  the 
orthogonal  case. 

The  factor  will  measure  a  significant  gain 
if  k  is  not  too  small  relative  to  hK.     For 
example,  with  the  230-area  Curent  Population 
Survey  (CPS)  sample,  we  had 

n  =  85 
kK  =  88 
k   =  20  (or  ko) 


so  that 


f  = 


68 
87 


,782  (or  ^|  =  .552). 


With  357  CPS  areas,  there  are  2^5  non-self - 
representing  (NSR)  primary  sampling  units  (PSU's) 
grouped  into  about  120  strata: 

n  =  120 
k  =  20 

f  .  i22  .  .8U0  =  8i+$. 

CONCLUSION 

A  factor  of  8U  percent  in  the  variance  of 
the  estimate  of  the  variance  may  (or  may  not) 
warrant  the  setting  up  of  120  orthogonal  repli- 
cations for  the  present  CPS  sample,  and  then 
selecting  20  of  them  for  computing  variances. 

If  the  number  of  replications  were  ^0, 
the  factor  would  be 

^  -    67* 

which  would  surely  justify  the  use  of  orthogonal 
replications. 
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This  table  shows  the  reduction  in  variance 
for  samples  of  20  and  Uo  orthogonal  replica- 
tions, for  various  numbers  of  grouped  strata. 


VARIANCE  OF  A  SAMPLE  OF  ORTHOGONAL  REPLICATIONS 

AS  A  PERCENT  OF  THE  VARIANCE  OF  THE  UNRESTRICTED 

REPLICATION  METHOD 


Number  of 
grouped  strata 

Number  of  samples  from  a  set  of 
orthogonal  replications  (percent) 

20 

40 

40 

51* 

78 

84 
90 

85 

55$ 

67 

80 

120 

AN  ESTIMATOR  OF  BETWEEN-RATER  VARIANCE  APPLICABLE  TO  THE 
LOUISVILLE  EXPERIMENTS  ON  CONDITION  OF  HOUSING 

Leon  Pritzker 


The  purposes  of  this  memorandum  are  to 
record  an  estimator  and  to  provide  its 
rationale. 

We  consider  first  an  experiment  in  which 
there  is  no  subsampling  of  housing  units  within 
a  block;  i.e.,  a  stratified  sample  of  blocks  is 
drawn  and  all  the  housing  units  in  each  block 
are  rated  twice.  There  are  two  raters  assigned 
to  each  block,  and  they  make  their  ratings 
independently  of  each  other. 

Let 

X  , ,   =1,  if  rater  e  rates  the 

sbhe    ' 

h-th  housing  unit  in  the 

b-th  block  of  stratum  s 

as  "dilapidated." 

=  0,  if  the  rating  is 

"not  dilapidated." 

N  ,    =  total  number  of  housing 
sb  6 

units  in  block  b  of 
stratum  s. 

N 

sb 
X  ,    =  Z   X  ,,   =  number  of 
sbe    ,  _,   sbhe 
h=l 

housing  units  in  block  b 

of  stratum  s  rated  as 

"dilapidated"  by  rater  e. 

D  .   =  (X  .  -X  .  )  -  (X  ,  ,-X  J 

sb      sbe  sb      sbe'   sb 

=  X  ,  -X  ,  ,  =  difference  in 
sbe  sbe' 

the  number  of  housing  units 

rated  as  "dilapidated"  by 

two  raters  in  block  b  of 

stratum  s. 

Then,  on  the  assumption  that  the  two 
raters  are  a  random  sample  drawn  from  a  large 
universe  of  raters,  it  can  be  shown  that  the 
between-rater  variance  of  an  estimate  of  the 
total  number  of  dilapidated  houses  in  a  block 
for  a  fixed  block,  is 


2  ED*. 

2   sb 


(1) 


'sb 


where  the  expectation  is  taken  over  all  pairs 
of  raters . 

Now  consider  an  experiment  in  which  a 
subsample  of  housing  units  is  drawn  within  each 
sampled  block  and  the  two  ratings  are  made  for 
each  of  the  subsampled  units  as  above: 

Let 

d  ,,  =  x  , .   -  x  . ,  ,  =  difference 
sbh    sbhe    sbhe' 

between  the  ratings  of  two 

raters  for  the  h-th  subsampled 

housing  unit  in  block  b  of 

stratum  s,  with  respect  to  the 

class  "dilapidated." 

n    =  number  of  subsampled  housing 

units  in  block  b  of  stratum  s. 

n 

=  difference  in  the 


d  ,  =     Z     d  , , 
sb    ,  n   sbh 
h=l 


number  of  subsampled  units  rated 
as  "dilapidated"  by  the  two 
raters  in  block  b  of  stratum  s. 
n  u 

S  D 

f  ,   =  - —  =  sampling  fraction  for  the 
sb    N  . 
sb 

subsample  of  housing  units  in 

block  b  of  stratum  s. 

dsb 
d'    =  - —  (This  will  be  used  as  an 
sb    f  , 
sb 

estimator  of  D  ,  .  However,  see 

sb         ' 

the  last  paragraph  of  this 
memorandum. ) 
We  define  the  sampling  deviation: 


Then: 


a  ,  =  a»  -  d  , . 

sb      sb    sb 


d'   =  D  .  +  A  .  . 
sb      sb    sb 


(2) 
(3) 
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The  question  arises:   What  is  the  relation- 
ship between  what  we  want,  namely  \  D2  ,  and 


sbJ 


what  we  can  get,  namely  -id'2?   From  (3) 


2  Ed'2 

d       sb 


'  "sb 
r 


E 


T2 


D  v  +  A  v. 

sb    sb 


(M 


Neglecting  the  interaction  between  response  and 
sampling  deviations: 


Thus: 


iEd'2 

d       sb 


iED  2 


sb 


*  *E[DsV^b] 


(5) 
(6) 


We  have  to  subtract  an  estimate  of  the 

within-block  sampling  variance  from  d'2  to 

*  sb 

obtain  an  estimate  of  the  desired  between- 
rater  variance.   From  the  definitions: 


E  A 


sb 


=  E 


N  ,    sb 

n  .  .  _   sbh 
sb  h=l 


sb 
Z 
h-1 


sbh 


(7) 


Converting  totals  to  means : 


E  A 


sb 


=  Nf  Ed 


sb 


5  ,  -  D  .  T 
I  sb    sb 


(8) 


Assuming  simple  random  sampling  without 

replacement,  we  can  write  the  estimator,  A  ,  : 

e  '  '      sb 


U\   N  ,  -n  , 
sb  sb  sb  h=l 


sb 
Z  (d 


>,-*  J2 
sbh  sb7 


sb 


sb 


sb 


sb 


n  ,(l-f  .  )  J11    (dsbh_dsb)2 
sb    sby  h=l 


(9) 


(10) 


'sb 


sb 


Thus,  the  desired  estimator  of  the  between- 
rater  variance  for  a  single  block  is: 


sb 


"sb 

n  (1-f  )  Z  (d  ..  -d  ,  )2 

sb    sb  h=l   sbh  sb 


sb 


'sb 


n  v-1 
sb 


(11) 


To  obtain  an  estimator  of  the  average 
between-rater  variance  (9)  for  a  stratified 
sample  of  blocks,  where  blocks  in  the  same 
stratum  have  the  same  probability  of  selection 
but  the  probability  varies  from  stratum  to 
stratum: 


Let 


Then: 


=  number  of  sampled  blocks 
in  stratum  s 

=  weight  for  stratum  s 

=  number  of  strata. 


cp  = 


E  Ws  E  ^sb 
s=l  s  b   SD 


Z  w  m 
s  s 
s=l 


(12) 


There  is  a  defect  in  the  estimator,  however. 

It  can  be  shown  that  D2  is  a  function  of  the 

sb 

intraclass  correlation  among  deviations  from  the 
average  by  each  rater — the  correlation  that 
reflects  a  systematic  tendency  on  the  part  of  a 
rater  to  rate  more  leniently  or  more  severely 
than  the  average  rater.  This  intraclass  corre- 
lation varies  inversely  with  the  number  of  units 
actually  rated  by  each  rater.   Thus,  even  apart 
from  sampling  error,  d'2  will  be  a  biased  esti- 
mator of  D  2  because  the  raters  producing  the 

values,  d   ,  rate  a  subsample  of  units  in  each 
block.   In  the  Louisville  experiments,  the 
raters  worked  with  subsamples  of  less  than 
one-fourth,  in  general,  of  the  housing  units. 


THE  RESEARCH  TRIANGLE  INSTITUTE  ESTIMATOR  OF 
BETWEEN-RATER  VARIANCE 

Leon  Pritzker 


The  purposes  of  this  note  are  to  record  the 
Research  Triangle  Institute  (RTl)  estimator  of 
between-rater  variance  and  to  demonstrate  that 
the  two  available  estimators  of  between-rater 
variance  are  equivalent.  The  two  estimators  are 
developed  in  the  documents  cited  below  .   The 
same  symbols  are  used  except  that  the  subscripts 
indicating  stratum  (s)  and  block  (b)  are  not 
shown  since  the  demonstration  applies  to  a 
single  block. 

The  RTI  estimator  is  based  on  the  following 
tabulation  of  the  number  of  ratings  obtained  by- 
two  raters : 


Second  rater 

First  rater 

Dilapidated 

Not  dilapidated 

Not  dilapidated. 

a 

c 

b 
d 

The  estimator  is 


'1 


N(N-l)(b-c)2  -  N(N-n)(b+c) 
n(n-l) 


where  b  and  c  are  defined  in  the  table  above, 
and  N  and  n  are  defined  in  the  preceding 
document. 

The  estimator  in  the  preceding  memorandum 
may  be  written  as: 


9, 


d'2  -  A2 


The  net  difference  in  the  sample  count  of 
dilapidated  units  may  be  written  as  d  or,  in  the 
RTI  notation,  as  b-c.  Thus: 


n2 


From  equation  (9)  of  the  preceding  document 


£  K-5)2 

N2  N-n  h=l 
n   N 


A 


N(N-n 
n(n-l 


n-1 


Z  d2-n  d2 


h=l 


It  can  be  shown  that  the  gross  difference 
in  the  count  may  be  written  as 

n 

h=l  ^ 

or,  in  the  RTI  notation,  as  b+c.  Thus; 


N(N-n) 
n(n-l) 


(b+c)  -  n 


(b-c); 


And  the  estimator  may  then  be  written  in 
the  RTI  notation  as: 


SW  ♦ 


N(N-n) 
n2(n-l) 


(b-c); 


4^4  (b+c) 
n(n-l) 


r«2 


N2(n-l)+N(N-n)  (b_c)2 
n2(n-l) 


4^1  (b+c) 
n(n-l)     ' 


N( N-l)(  b-c) 2-N( N-n)( b+c) 
n(n-l) 


=  <P_ 


which  is  the  RTI  estimator. 


"""This  note  is  to  be  read  in  conjunction  with 
the  preceding  one,  "An  Estimator  of  Between-Rater 
Variance  Applicable  to  the  Louisville  Experiments 
on  Condition  of  Housing."  It  is  based  on  comments 
by  Walter  A.  Hendricks,  Research  Triangle  Insti- 
tute, on  a  draft  of  the  previous  memorandum. 
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